19 research outputs found
An OS-Based Alternative to Full Hardware Coherence on Tiled Chip-Multiprocessors
Institute for Computing Systems ArchitectureThe interconnect mechanisms (shared bus or crossbar) used in current chip-multiprocessors
(CMPs) are expected to become a bottleneck that prevents these architectures from scaling to a
larger number of cores. Tiled CMPs offer better scalability by integrating relatively simple cores
with a lightweight point-to-point interconnect. However, such interconnects make snooping
impractical and, thus, require alternative solutions to cache coherence.
This thesis proposes a novel, cost-effective hardware mechanism to support shared-memory
parallel applications that forgoes hardware maintained cache coherence. The proposed mech-
anism is based on the key ideas that mapping of lines to physical caches is done at the page
level with OS support and that hardware supports remote cache accesses. It allows only some
controlled migration and replication of data and provides a sufficient degree of flexibility in the
mapping through an extra level of indirection between virtual pages and physical tiles.
The proposed tiled CMP architecture is evaluated on the SPLASH-2 scientific benchmarks
and ALPBench multimedia benchmarks against one with private caches and a distributed direc-
tory cache coherence mechanism. Experimental results show that the performance degradation
is as little as 0%, and 16% on average, compared to the cache coherent architecture across all
benchmarks for 16 and 32 processors
Patterns and Rewrite Rules for Systematic Code Generation (From High-Level Functional Patterns to High-Performance OpenCL Code)
Computing systems have become increasingly complex with the emergence of
heterogeneous hardware combining multicore CPUs and GPUs. These parallel
systems exhibit tremendous computational power at the cost of increased
programming effort. This results in a tension between achieving performance and
code portability. Code is either tuned using device-specific optimizations to
achieve maximum performance or is written in a high-level language to achieve
portability at the expense of performance.
We propose a novel approach that offers high-level programming, code
portability and high-performance. It is based on algorithmic pattern
composition coupled with a powerful, yet simple, set of rewrite rules. This
enables systematic transformation and optimization of a high-level program into
a low-level hardware specific representation which leads to high performance
code.
We test our design in practice by describing a subset of the OpenCL
programming model with low-level patterns and by implementing a compiler which
generates high performance OpenCL code. Our experiments show that we can
systematically derive high-performance device-specific implementations from
simple high-level algorithmic expressions. The performance of the generated
OpenCL code is on par with highly tuned implementations for multicore CPUs and
GPUs written by expertsComment: Technical Repor